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Abstract. Instrumental variable (IV) methods are becoming increas- 
ingly popular as they seem to offer the only viable way to overcome the 
problem of unobserved confounding in observational studies. However, 
some attention has to be paid to the details, as not all such methods 
target the same causal parameters and some rely on more restrictive 
parametric assumptions than others. We therefore discuss and contrast 
the most common IV approaches with relevance to typical applications 
in observational epidemiology. Further, we illustrate and compare the 
asymptotic bias of these IV estimators when underlying assumptions 
are violated in a numerical study. One of our conclusions is that all IV 
methods encounter problems in the presence of effect modification by 
unobserved confounders. Since this can never be ruled out for sure, we 
recommend that practical applications of IV estimators be accompa- 
nied routinely by a sensitivity analysis. 

Key words and phrases: Causal inference, instrumental variables, Mendel- 
ian randomization, relative bias, structural mean models. 



1. INTRODUCTION 

Inferring causation in observational studies is prob- 
lematic, as observed associations can often be due 
to other than causal explanations, confounding be- 
ing of special concern. Randomized controlled trials 
(RCTs), rendering all other explanations unlikely 
by design, are the accepted standard approach to 
causal inference. However, we are here interested in 
epidemiological applications where it is not always 
possible nor desirable to carry out RCTs. For exam- 
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pie, it would be unethical or impractical to randomly 
allocate individuals to exposures such as smoking, 
alcohol consumption, and complex nutritional or ex- 
ercise regimes. Furthermore, the cohort of a trial 
might not be representative of the target population 
for which health interventions are required [16, 39]. 
The standard approach to causal inference from ob- 
servational data is to assume that there is no un- 
observed confounding, that is, that a sufficient set 
of covariates has been measured. This is often im- 
plausible and has produced misleading results in the 
past, for example, regarding the effects of hormone 
replacement therapy [38, 72]. 

Methods exploiting instrumental variables provide 
an alternative solution. Suppose we are interested in 
the causal effect of some exposure (e.g., cholesterol) 
on disease (e.g., coronary heart disease), and be- 
lieve that important confounding factors are likely 
but unobservable, perhaps because they are not fully 
understood. Loosely speaking, an instrumental vari- 
able (IV) is a third (observable) variable that is pre- 
dictive of exposure, but has no direct effect on the 
disease and is independent of the unobserved con- 
founders. In general, it is difficult to find a variable 
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that can be justified suitable IV for any par- 
ticular problem. For randomized trials with partial 
compliance, where the effect of the actual treatment 
taken is of interest, the natural IV is the randomiza- 
tion to treatment [29]; but, of course, this is not an 
option when considering exposures that cannot be 
randomized as mentioned earlier. Examples in epi- 
demiological contexts are the physician's prescrip- 
tion preference as an IV to assess drug effects [8, 55], 
cigarette price to assess the effects of smoking [41] 
or genetic variants that are associated with expo- 
sures of interest [16, 34, 39]. The latter has become 
known as Mendelian randomization and, due to the 
fact that it is currently generating a lot of interest 
in the epidemiological literature, will serve as illus- 
tration throughout (see Section 2). 

Relying only on their defining properties, IVs can 
be used to test for or bound the causal effect [2, 
4, 23, 29, 32, 58]. However, identification and hence 
point estimates of the causal effect are only obtain- 
able under additional parametric and distributional 
assumptions. Linear structural equation models, pop- 
ular in the econometrics literature [71, 74], are a 
well-studied model class that allows identification. 
Generalizations to nonlinear structural equations 
based on log-linear or probit modeling, for example 
[47, 70], are also available (see overview [13]). In- 
spired by the simplicity of the linear case, where the 
IV estimator is given as the ratio of the coefficients 
from the regressions of outcome on IV and expo- 
sure on IV, alternative methods have been put for- 
ward replacing these two linear regressions by non- 
linear ones. One such example which is popular in 
Mendelian randomization studies with binary out- 
comes is what we will call the "Wald-type" estima- 
tor. This combines odds ratios or risk ratios for the 
genotype-outcome relationship with the mean dif- 
ference in exposure given the genotype [10, 11, 16, 
35, 40, 66]. 

An important consideration when using IV meth- 
ods is the target of inference, that is, the precise 
definition of the causal parameter of interest. In our 
experience, epidemiologists are mostly interested in 
the population causal effect, that is, a comparative 
measure of subjecting everyone in a given popula- 
tion to exposure as opposed to no exposure, as would 
ideally be obtained in an RCT. However, some promi- 
nent IV methods target causal effects within spe- 
cific subgroups. These are the effect of treatment 
on the compilers [2, 33], or the effect of treatment 
on the treated [32, 57, 59, 67]. The compiler causal 



effect is motivated by RCTs with partial compli- 
ance and contrasts the effect of treatment versus 
nontreatment for those individuals who follow their 
assignment whatever it is. In our view, the interpre- 
tation of this causal parameter is very much bound 
to the randomization scenario and we will therefore 
not consider it any further. The effect of "treatment 
on the treated" can be translated as the "effect of 
exposure on the exposed" in an epidemiological con- 
text and describes the effect of preventing those who 
would normally be exposed from becoming exposed. 
This particular subgroup effect is explicitly modeled 
by structural mean models (SMMs) [32]. 

In this paper we compare the above approaches 
with regard to their use in observational epidemiol- 
ogy and focus on issues that have recently arisen, for 
example, in Mendelian randomization applications, 
to make the discussion concrete. We formally con- 
sider the targeted causal parameters and the under- 
lying modeling assumptions of IV methods. We ar- 
gue that their assumptions should be made explicit 
so that those most plausible for a given problem 
can be chosen. As models are never expected to be 
exactly true in practice, we complement the theoret- 
ical comparison by a numerical study of the possible 
bias under violations of the assumptions. The out- 
line of the paper is as follows. In Section 2 we begin 
by presenting the basic idea of IVs with the exam- 
ple of Mendelian randomization as recently applied 
to investigate the effects of alcohol consumption. We 
then introduce the main concepts of causal inference 
in Section 3, a central issue being the different no- 
tions of causal effect parameters. Section 3.2 gives 
the core conditions characterizing an instrumental 
variable. In Section 4 we present the IV models that 
we will consider, and provide general indications 
of how they interrelate. Section 5 investigates the 
performance in terms of relative asymptotic bias of 
these methods in a numerical study where we focus 
on the particular case where all variables are binary 
in order to facilitate exact evaluation of the relevant 
quantities. We conclude with a discussion of the im- 
plications, both for epidemiological applications and 
more generally. 

2. USING A GENETIC VARIANT AS AN IV 

We will relate to Mendelian randomization through- 
out the paper as a concrete application of an IV 
approach in observational epidemiology and outline 
the basic idea here using an example taken from 
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Chen et al. [12]. Further details, including history 
and nomenclature, are provided in a recent review 
[15]. 

Alcohol consumption has been found in observa- 
tional studies to have a positive effect on coronary 
heart disease (CHD) and negative effects on liver 
cirrhosis, some cancers and mental health problems. 
These findings, however, are strongly suspected to 
be confounded by factors like diet, lifestyle and so- 
cioeconomic factors. Thus, in order to inform pub- 
lic health recommendations on alcohol intake, for 
example, it is important to verify which, if any, of 
these observed associations is in fact causal for the 
relevant health outcome. 

The connection between the ALDH2 gene and al- 
cohol consumption is well established and under- 
stood [6, 26, 42, 73]. The ALDH2*2 variant is as- 
sociated with an accumulation of acetaldehyde and 
hence with unpleasant symptoms after drinking al- 
cohol. Carriers of this variant tend to limit their al- 
cohol consumption regardless of their other lifestyle 
behaviors. Since genes are randomly assigned during 
meiosis, ALDH2*2 carriers should not differ system- 
atically from carriers of the ALDH2*1 allele in any 
other respect. In particular, there should be no as- 
sociation between the variant and the unobserved 
confounders of the various relationships between al- 
cohol consumption and above health outcomes. The 
plausibility of this assumption is strengthened by 
the fact that there is no evidence of ALDH2 associa- 
tion with typical known epidemiological confounders 
such as age, smoking, BMI, cholesterol, etc. [19]. The 
possibility that ALDH2 affects the particular disease 
of interest by any route other than through alcohol 
consumption can also be excluded from the known 
functionality of the gene. Thus, for any specific dis- 
ease, we should observe that there are more *1*1 and 
*1*2 than *2*2 genotypes among the affected indi- 
viduals if alcohol consumption is really causal for 
that disease. The meta-analysis by Chen et al. [12], 
based mainly on studies in Japanese populations, 
shows that blood pressure and risk of hypertension 
is higher for *1*1 than for *2*2 homozygotes, and 
is also higher for heterozygotes (*1*2) than for the 
*2*2 homozygotes. As the heterozygotes tend to be 
moderate drinkers due to less pronounced adverse 
symptoms, the study concludes that even moderate 
alcohol consumption is "harmful" for blood pres- 
sure. 

The example shows how ALDH2 can be used as 
an IV to provide evidence for a causal effect of the 



exposure by establishing that the disease and the 
IV are associated: the risks of high blood pressure 
and hypertension are significantly different between 
the different genotypes. As ALDH2 is assumed to 
have no direct effect on blood pressure or hyperten- 
sion other than through alcohol consumption, the 
observed associations must be due to an effect of 
alcohol consumption on blood pressure and hyper- 
tension. Since the above assumptions define an IV, 
this reasoning only holds if we can be fairly con- 
fident that ALDH2 is a valid IV. Hence, only well- 
understood genotypes can be used as IVs. Note, this 
does not yet provide a point estimate of the causal 
effect of alcohol consumption on hypertension: it is 
merely evidence that there is such an effect. 

The number of applications of Mendelian random- 
ization is growing rapidly [10, 17, 18, 36, 42, 43, 46, 
66] ; a brief overview of some recent studies is given 
in Sheehan et al. [64]. Note that even when a ge- 
netic variant can be found that is associated with 
the exposure of interest, it does not automatically 
qualify as an IV. Problems could occur when there 
are different subpopulations with different allele fre- 
quencies and different prevalences of disease, for in- 
stance, [9]. Finding a suitable genetic instrument is 
thus a challenge as discussed in detail in several pa- 
pers [16, 23, 24, 39, 48, 65]. 

3. CAUSAL INFERENCE 

Epidemiologists are concerned with identifying the 
causal effect of an exposure X on a disease Y, typi- 
cally with the view to informing public health inter- 
ventions. We therefore regard causal inference to be 
about the effect of intervening in, or manipulating, 
a given system as is implicit in many approaches to 
causal inference [21, 24, 31, 37, 49, 57, 62, 63]. 

It is useful to introduce notation to represent in- 
tervention. Pearl [51] uses the do operator to dis- 
tinguish between conditioning on an intervention in 
X, P(Y\ do(X = x)), and the usual conditioning on 
observing X, P(Y\X = x). The former reflects how 
the distribution of Y should be modified when X has 
been forced to the value x by some external interven- 
tion, whereas the latter reflects how the distribution 
of Y should be modified when X = x is simply ob- 
served. The different conditions, observation versus 
intervention, reflect the common wisdom correlation 
is not causation. Note that we often write do(x) for 
do(X = x). 

Another formal approach is based on counterfac- 
tual (potential outcome) variables [31, 62, 63]. Here 
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Y(x\) denotes the value that the outcome Y would 
have if the variable X were set to the value xi, 
whereas Y(x 2 ) is the outcome if the same variable 
X were set to the value x 2 . The variables Y{x\) 
and Y{x 2 ) are counterfactual because they can never 
both be observed together, so when one is fact, the 
other one is, of necessity, contrary to fact. The no- 
tion of intervention also underlies the counterfactual 
approach [31, 60, 62, 63]. Both approaches define 
a formal language for causality and provide spe- 
cific mathematical notation for representing inter- 
ventions that we might be interested in. Hence, they 
force us to be clear and explicit about any assump- 
tions underlying a given method of causal inference. 

3.1 Causal Parameters 

Causal effect parameters are typically functions of 
the distribution of Y under different interventions 
in X. The most popular is the average causal effect 
(ACE) defined as the expected difference in Y under 
two different settings of X: 

ACE( Xl ,x 2 ) := E{Y\ do{x 2 )) - E(Y\ do(xi)), 

where x\ is typically some baseline value. The ACE 
is a natural choice of causal parameter when the ef- 
fect of X is suspected to be linear on Y . When Y is 
nonnegative or binary, in contrast, it is more com- 
mon to use a multiplicative measure like the causal 
relative risk (CRR) defined as 



(1) 



CKR(x 1 ,x 2 ) 



E(Y\do(x 2 )) 
E(Y\do( Xl )y 



or, for binary Y, the causal odds ratio (COR) given 
by 

p(y = i|do(x 2 ))P(y = o|do( a ; 1 )) 

[ u 2) • P(Y = 0| do(x 2 ))P(Y = 1| do(xi)) ' 

Note that the odds ratio is mainly used in case- 
control studies to approximate the relative risk in 
the case of a rare disease. 

All these causal parameters are population param- 
eters, that is, they compare setting X = x\ with 
setting X = x 2 for the whole population of inter- 
est. They are what is measured in a comparison of 
the active and control groups in a controlled ran- 
domized experiment when all subjects comply with 
their treatment assignment. In some situations, we 
may be more interested in the causal effect within 
a subset of the population, that is, conditional on a 
specific value of some observed covariates. For ex- 
ample, we might want to know the average causal 



effect of male alcohol consumption on oesophageal 
cancer risk. The above causal parameters can easily 
be adapted by conditioning on covariates provided 
these are prior to exposure. We will not consider this 
further in the present paper. 

However, one particular causal subgroup effect, or 
local causal effect, is very relevant in the epidemi- 
ological literature. This is the effect of exposure on 
the exposed group [29, 57, 58], or the effect of treat- 
ment received as it is known in the context of clinical 
trials (cf. [25], e.g.). For example, we might be inter- 
ested in the effect of reducing alcohol consumption 
for those individuals who would normally tend to 
have high alcohol consumption, but not in the ques- 
tion of increasing alcohol consumption for those who 
normally do not drink much. This does not quite cor- 
respond to conditioning on observed covariates, as 
what the subjects "would normally" be exposed to 
in the future is not usually observable. However, it 
can be assumed that if no intervention takes place, 
alcohol consumption will remain high for those in- 
dividuals with existing high consumption. In coun- 
terfactual notation the corresponding local causal 
relative risk, LCRR, for instance, is given by 



(2) 



LCRR := 



E(Y(x)\X = x) 
E(Y{0)\X = x)' 



where Y(x) is the value of the outcome if an individ- 
ual's alcohol consumption is set to be x and Y(0) is 
the counterfactual outcome if it is set to be at a base- 
line level, while conditioning on X = x means that 
the "natural" alcohol consumption is x. Note that 
given X = x, we actually observe Y = Y(x), so that 
the numerator of (2) is equal to E(Y\X = x). This 
type of causal parameter can also be expressed with 
the do-notation, but we need to distinguish between 
the "natural" value of exposure X and the one that 
it is set to by intervention X . When no intervention 
takes place, these two are identical, that is, X = X . 
However, when an intervention takes place, it is as- 
sumed that X "overrules" X so that Y causally de- 
pends on X while being still associated with X due 
to the fact that X is informative for the unobserved 
confounding that also predicts Y. The above can 
then be translated to 



(3) 



LCRR:= 



E(Y\X = x,do(X = x)) 
E(Y\X = x,do{X = 0))' 



See Robins, VanderWeele and Richardson [61] and 
Geneletti and Dawid [28] for more details on how 
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to interpret this local causal effect without coun- 
terfactual notation. Local versions of the ACE and 
COR can easily be defined analogously to the above 
LCRR. Note that the term "local" causal effect in 
the IV literature is most commonly used for the 
effect of treatment on the "compilers" in an RCT 
[2, 29, 33], which we are not dealing with here and 
which is different from (3). 

One further causal parameter that is sometimes 
considered is the individual causal effect which is ex- 
pressed with potential outcomes as Y l (x%) — Y % (x\). 
It is the difference between the potential outcomes 
for a specific individual i. Assumptions under which 
the individual causal effect can be identified are in- 
herently untestable [20], but may be justified given 
specific subject matter background knowledge. 

Finally, we want to emphasize that a population 
parameter like CRR in (1) will be different from 
a conditional or local parameter LCRR in (2) or 
from an individual causal effect when the effect of 
exposure is different in different subgroups or indi- 
viduals, that is, under heterogeneity or effect mod- 
ification. For instance, those who naturally have a 
high alcohol consumption are likely to be different in 
many other relevant but unobservable respects than 
those who have a naturally low alcohol consump- 
tion and, therefore, the effect of changing that level 
should be different in these two groups. In partic- 
ular, there may be no overall effect in the popula- 
tion (i.e., CRR = 1) if negative and positive effects 
in subgroups (or individuals) cancel each other out. 
In such a situation, an estimator that targets the 
CRR will be biased for the LCRR and vice versa. 
We reiterate that the accepted gold standard RCT 
randomizing individuals to either x\ or X2 always 
targets a population causal effect. 

3.2 Instrumental Variables 

The standard approach to estimating a causal pa- 
rameter from observational data is to assume that 
a sufficient set of observed confounders is available 
for which we then adjust [21, 30, 37, 51, 62]. When 
there is reason to suspect additional unobserved con- 
founding, the causal effect cannot typically be ob- 
tained in this way. In this situation, IV methods 
permit a different way of performing causal infer- 
ence by exploiting the additional information pro- 
vided by the instrumental variable. 

Recall that we denote the exposure of interest (in- 
termediate phenotype or modifiable risk factor) by 
X and the outcome (disease) by Y. Furthermore, 



we let G be the instrument (e.g., genotype in a 
Mendelian randomization study) and U an unob- 
served variable (or, more realistically, a set of unob- 
served variables) that will represent the confounding 
between X and Y. The properties that define an IV 
are expressed in terms of conditional independence 
statements where A AL B\C means A is independent 
of B given C . The core conditions are the following: 

1. G ALU, that is, G must be (marginally) inde- 
pendent of the confounding between X and V; 

2. G JL X , that is, G must not be (marginally) 
independent of X; 

3. G AL Y\(X, U), that is, conditionally on X and 
the confounder U, the instrument and the response 
are independent. 

These properties can, to a limited extent, be tested 
from the observable data (i.e., without measurements 
on U) when G,X, Y are all categorical. This is be- 
cause they impose certain inequality constraints on 
the joint distribution p(y,x,g) (see [50, 51] for de- 
tails). Analogous constraints can also be obtained 
for situations where joint observation of (G,X,Y) 
is not possible, but separate observations on (G, X) 
and (G,Y) are available from different studies [52], 
for instance, as is often the case for Mendelian ran- 
domization applications. Furthermore, Ramsahai [53] 
develops a statistical test for violation of these in- 
equality constraints that properly accounts for the 
sampling variability in the estimated probabilities. 
When the data are categorical, these inequalities 
should always be verified in order to detect "gross" 
violations of the above core conditions. However, it 
should be kept in mind that distributions p(y, x, g, u) 
will exist which violate the core conditions but may 
have marginals p(y,x,g) that still satisfy these in- 
equalities. We are not aware of analogous inequality 
constraints that could be checked when X is con- 
tinuous (but see [5] for the case where instrument 
or outcome are continuous). Categorizing continu- 
ous variables is not advisable, as it is possible that 
the continuous variables satisfy the above core con- 
ditions, while their discrete versions do not. Hence, 
since a test of the inequalities can only falsify the 
core assumptions but never confirm them, and since 
it cannot be carried out when the exposure is con- 
tinuous, it is crucial to always justify the core condi- 
tions on the basis of subject matter or other relevant 
background knowledge. 

A shorthand way of encoding conditional inde- 
pendence restrictions is via graphical models [14]. 
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The directed acyclic graph (DAG) in Figure 1 is the 
unique representation of the above core conditions. 
Furthermore, this graph is equivalent to a factor- 
ization of the joint density on (Y, X,U,G) in the 
following way: 

(4) p(y,x,u,g) = p(y\u,x)p(x\u, g)p(u)p(g). 

While this describes how the variables behave "nat- 
urally," we have to specify our assumptions about 
how an intervention in X operates on the system. 
This takes the form of an additional structural as- 
sumption which states that intervening in X does 
not affect the distributions of any other factors in 
(4) besides the conditional distribution of X. Under 
intervention on X, the joint distribution of (4) thus 
becomes 

p{y,u,g,x\do(x )) 

(5) 

= p(y\u,x )I{x = x )p{u)p(g), 

where /(•) is the indicator function. The correspond- 
ing DAG in Figure 2 graphically shows the condi- 
tional independence relationships among Y, G and 
U for the core conditions and an intervention on X. 
One immediate implication is that G ALY\do(X), 
which is also known as the exclusion restriction con- 
dition in the IV literature, where it is typically ex- 
pressed with potential outcomes as G JL Y(x) [2, 
33]. 

Taking this a step further, we can also express the 
IV assumptions when the effect of exposure on the 
exposed individuals is of interest. Using the notation 
introduced in Section 3.1, let X denote the "natu- 
ral" exposure level, while X denotes the exposure 
that is set by an intervention. When there is no in- 
tervention, they are identical and (4) is valid. Under 

U 

/\ 

G > X ► Y 

Fig. 1. The DAG representing the core conditions required 
for G to be an instrument. 



intervention, X overrules the "natural" X with re- 
spect to the conditional distribution of Y and we 
obtain the joint distribution under intervention 

p{y,u, g, X = x\do(X = x )) 

= p(y\u,x )p(X = x\u,g)p{u)p(g), 

which can again be represented graphically with a 
DAG as in Figure 3 [28, 61]. As before, we have the 
exclusion restriction Y JL G\ do(X), but we can also 
derive, for instance, that Y is not independent of G 
given X and do(X). 

4. SOME COMMON IV MODELS 

With the above core conditions 1-3 and structural 
assumption of (5), the IV can be used to test for the 
presence of a causal effect, or to derive lower and 
upper bounds on causal effects for the case when 
all variables are categorical [4, 22, 44, 57]. How- 
ever, for general distributions of (X,Y,G,U), the 
core conditions alone do not necessarily allow point- 
identification of causal effects, except for some ex- 
tremely unusual situations [29] . 

Below we present some common model restric- 
tions, that is, additional parametric assumptions, 
that enable point-identification of causal parame- 
ters. When the causal parameter is identified, it can 
be estimated consistently; in practice, small sample 
sizes can still induce problems, but we will ignore 
this issue here. 

Our terminology is as follows. Let 9* be the true 
causal parameter of interest, for example, the CRR 
0* = E P *(Y\do(x 2 ))/Ep*(Y\do(x 1 )), where expec- 
tations are taken with respect to the true distribu- 
tion P*. Restrictions are imposed in the form of a 
statistical model M, which is simply a set of distri- 
butions with some common characteristics for the 
random variables of interest, for example, the condi- 
tional mean of Y being linear in X. The model A4 is 
correctly specified if P* € M. The model M further 
allows point-identification of the true causal effect 
parameter when 9* is equal to a function 9_m (Pxyg) 
that only depends on the observational (i.e., not 



U 



X 



\ 



"> Y 




Fig. 2. The DAG representing the core conditions under in- 
tervention in X. 



Fig. 3. The DAG representing the core conditions under in- 
tervention in X and "natural" exposure X . 
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interventional) distribution of the observable vari- 
ables. The exact form of the function 9 m depends on 
the model assumptions, that is, on KA. If the model 
KA is misspecifled, then it does not contain the true 
distribution P* and 9m{Pxyg) wm n °t necessar- 
ily be equal to 6*, as the former relies on wrong 
model assumptions. We call the causal parameters 
of interest, 9*, the target of inference, and we call 
&m(Pxyg) the estimand regardless of whether the 
model is correctly specified or not. Hence, the esti- 
mand is equal to the target under a correct model 
and otherwise potentially different. Note that the in- 
tervention distribution P* (Y = y | do(x)) itself might 
be of interest as a target, and, if identified, any 
causal parameter can be obtained from it. 

As we will see below, 9m can typically be ex- 
pressed in terms of conditional probabilities or ex- 
pectations with respect to the observational distri- 
bution of X, Y and G. For practical data analysis, 
these have to be replaced, for instance, by the cor- 
responding empirical relative frequencies, averages 
or regression coefficients, assuming that we have an 
independent identically distributed (i.i.d.) sample of 
(X, Y, G) ; this then yields an estimator 9 m ■ We will 
not go into the details of the actual construction 
of estimators 9m as functions of the sample but 
will focus on how different models Ad allow point- 
identification and what the corresponding estimands 

°m(P x ,y,g) are - 

Note that when parametric assumptions are made, 

the core conditions can sometimes be weakened, for 
example, by requiring only that G and U are un- 
corrected, but we do not discuss this further here. 
Also, some, but not all, of the following approaches 
are only defined when Y and/or G are binary. This 
will be indicated when relevant. 

4.1 Linear IV Models 

The classical IV method was developed in the con- 
text of linear models KA which we define in more de- 
tail below, and results in an estimator 9m, given as 
the ratio of the least squares slope estimators from 
linear regressions of Y on G and of X on G. We will 
call this the linear IV average effect estimator, and 
its estimand 9m is 



(6) 



LIVAE := 



Cov(y,G) 
Cov(X, G) ' 



The LIVAE can equivalently be estimated by ob- 
taining predicted values X from the regression of X 



on G and then by regressing Y on X. It is there- 
fore known as two-stage least squares [1, 71]. In the 
special case of binary instrument G, we have 



(7) 



LIVAE 



E(Y\G = 1)-E(Y\G = 0) 
E{X\G = 1)-E(X\G = 0)' 



This is analogous to the Wald method, which was 
originally proposed to deal with the case of measure- 
ment errors in both variables X and Y [7, 69]. As 
we shall now discuss, the LIVAE identifies either the 
population, individual or local average causal effect 
(ACE, ICE or LACE), depending on the particular 
model assumptions. 

In addition to the three IV conditions and struc- 
tural assumption (5), assume that the conditional 
expectation of Y is linear without interactions and 
that all dependencies only affect the mean. Then 



(8) 



E(Y\X = x ,U = u) = E(Y\ do(X = x),U = u) 
= f3x + h(u), 



where h(u) is some function of u only. With a 
E(h(U)), we have 



(9) 



E(Y\do(X = x)) = a + (3x, 



so that the ACE for a unit difference in X is equal 
to the model parameter /?, while the causal relative 
risk CRR(xi,X2) under this model is equal to (a + 
f3x2)/(a + fixi ) . It can easily be seen (cf. Appendix) 
that, under the above assumptions, j3 = Cov(Y,G)/ 
Cov(A, G). Hence, the LIVAE identifies the ACE. In 
the Appendix we show that the CRR is also identi- 
fied in this linear model and we will call the corre- 
sponding estimand LIVRR. 

When Y is binary, for example, assumption (8) 
cannot hold exactly, as it allows E(Y\X, U) to take 
values outside [0,1]. It might still be used as a sen- 
sible approximation in practice, especially when the 
range of X is restricted and its effect is small. As 
mentioned above, causal relative risks and odds ra- 
tios that might be of more interest for binary Y can 
also be identified based on the linear model as de- 
tailed in the Appendix. 

Under stronger model assumptions, such as those 
common in the econometrics literature, for instance, 
the LIVAE identifies the individual causal effect, 
ICE. A structural equation model describes how the 
individual responses Y l depend structurally (i.e., 
under manipulation) on other variables [51, 71]. This 
can also be expressed using counterfactuals [8]. A 
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structural equation counterpart for (9) that param- 
eterizes the ICE is given by 

(10) Y i {x) = p I x + C 

for individual i, where can be regarded as a com- 
bination of IP and other (nonconfounding) factors 
that determine the outcome. The problem of con- 
founding by U leads to £ and X not being indepen- 
dent, so that Pi cannot be estimated consistently 
from a regression of 7 on X, and the LIVAE is 
used instead. For the interpretation it is important 
to note that model (10) explicitly assumes that the 
causal effect is the same for each individual i, while 
(8) assumes that manipulating X has the same av- 
erage effect regardless of the value of U on the linear 
scale. In fact, model (10) implies (8), but the con- 
verse is not true (see the Appendix for details). 

Each of models (8) and (10), together with the 
IV assumptions, allows us to identify the effect of 
exposure on the exposed, that is, the local aver- 
age causal effect LACE, via the LIVAE from (6). 
However, the LACE can be identified under weaker 
model assumptions, namely, those of an additive 
structural mean model (cf. the Appendix or Hernan 
and Robins [32]). Using the notation introduced in 
Section 3.1, let X denote the "natural" exposure 
level, while X denotes the exposure that is set by 
an intervention (overruling the "natural" X). An 
additive SMM assumes that 

E(Y\X = x,G = g) 

(11) 

- E(Y\X = x,G = g, do(X = 0)) = p L x, 

where X = again denotes a suitable baseline value. 
Here, fax is the effect of reducing the exposure to 
this baseline value for those who under "natural" 
circumstances are exposed to X = x and have G = 
g. Note that this additive SMM makes no explicit 
assumptions about individual causal effects or the 
role of U; in fact, U is allowed to modify the effect 
of X on Y. Implicitly, however, the manner in which 
Y depends on U is restricted by the assumption that 
the above difference in conditional expectations (11) 
does not depend on G. The different interpretation 
of the LIVAE in the context of linear models and 
presence of effect modification is also discussed by 
Brookhart and Schneeweiss [8]. 

In summary, we can use the LIVAE to estimate (i) 
the individual causal effect, if we believe that the in- 
dividual effect is the same for everyone on the linear 
scale, or (ii) the average causal effect, if we believe 



that the average effect is the same for different val- 
ues of U, or (iii) the local effect on the exposed, if 
we believe that this is the same for different values 
of G. 

4.2 Nonlinear Wald Type Methods 

As mentioned earlier, the LIVAE is the same as 
Wald's estimator which was originally devised to 
deal with measurement errors [69] . In this section we 
consider two further methods leading to ratio based 
IV estimators and which we will therefore call Wald 
type estimators (cf. also [39, 46, 66]). 

Several applications of Mendelian randomization, 
typically considering a binary outcome Y , a contin- 
uous exposure X and a dichotomous genotype G, 
have used the following reasoning to obtain an IV 
estimator for a causal effect [10, 11, 16, 35, 40]. The 
naive odds ratio of Y given X, which we denote 
NOR, is suspected to be confounded. The odds ra- 
tio of Y given the instrument G, which we denote 
by OR(V|G), is not confounded due to core condi- 
tion 3, and should be roughly equal to the causal 
odds ratio, COR, between X and Y scaled by the 
mean difference in exposure for the two genotypes, 
5 = E(X\G = 1) - E(X\G = 0), that is, OR(Y\G) » 
COR" 5 . Therefore, in these applications, the quantity 
NOR* 5 is compared with OR(V|G) and, if similar, 
the conclusion is drawn that there is no confound- 
ing and, hence, that NOR « COR. We thus consider 
the following as the Wald type odds ratio estimand: 

WaldOR:=OR(Y|G) 1/<5 . 

(On the log-scale this is the ratio of log-odds differ- 
ence and the mean difference 5, hence "Wald type.") 
At first sight, this reasoning seems heuristic, and 
there is no model assumption from which it can 
be theoretically derived. However, by regarding the 
odds ratio as an approximation to the relative risk 
for rare diseases, we can motivate the above formula 
theoretically. The following is a slight generaliza- 
tion of the structural equation approach presented 
by Mullahy [47] and suitable not only for binary 
but also for general nonnegative response Y (X and 
G can be continuous or discrete). Assuming a log- 
linear model [and structural assumption (5)], 

logE(Y\X = x,U = u) 

(12) = logE(Y\do(X = x),U = u) 

= jx + h(u), 

where h(u) is some function of u only. It can then 
easily be seen that the causal relative risk for one 
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unit difference in X is simply CRR = exp7. Further, 
we suppose that X has conditional mean 

(13) E(X\G = g,U = u) = Sg + k(u), 

where k(u) is some function of u only, and, in ad- 
dition, we require that the distribution of X is such 
that 



(14) 



[X - (5G + k(U))] _IL G\U. 



Note that this requirement cannot be satisfied when 
X is binary, for instance, but is automatically true 
when it has a conditional normal distribution. It can 
now be shown (cf. the Appendix or Mullahy [47]) 
that the CRR is identified because 7 is equal to the 
ratio of the log-coefficient from a loglinear regression 
of Y on G and the coefficient from a linear regression 
of X on G. In the special case of a binary instrument 
G, this simplifies to 



7 



log E(Y\G= I)- log E(Y\G = 0) 
E(X\G = 1)-E{X\G = 0) ' 



The method of estimating this via two regressions 
as mentioned above is also called two-stage quasi 
maximum likelihood [47]. We will refer to the esti- 
mand based on the right-hand side of the above as 
the Wald relative risk (WaldRR), given by 



WaldRR := RR(Y|G) 



1/5 



where RR(Y|G) is shorthand for the relative risk of 

Y given G. When G is binary, 5 is the mean differ- 
ence in X, otherwise it is Cov(X, G)/ Var(G). 

The WaldRR identifies the CRR under the above 
combination of log-linear model for Y given X and 
U, and the stated assumptions on the conditional 
distribution of X given G and U. Note that when 

Y is nonnegative and continuous, it is, in principle, 
possible (but not common) to elaborate the assump- 
tions further so that the individual relative causal ef- 
fect Y l {x2) /Y l {x\) is identifiable; model (12) would 
then need to be reformulated as a structural equa- 
tion model analogously to the linear case earlier. 

When Y is binary and P(Y = 1) is small ("rare 
disease assumption"), WaldRR and WaldOR will be 
approximately the same, so that in this case we can 
argue that the WaldOR approximately identifies the 
COR under the same model assumptions. A differ- 
ent justification of WaldOR has been proposed by 
[3] based on a logistic SMM and some very rough ap- 
proximations, under which it identifies the LCOR. 



4.3 Multiplicative Structural Mean Models 

We already mentioned that the LIVAE can be 
justified in an additive SMM identifying the causal 
mean difference within the exposed individuals (LACE). 
Alternatively, we now consider a multiplicative struc- 
tural mean model (MSMM) [32]. Again using the 
notation introduced in Section 3.1, let X denote the 
"natural" exposure level, while X denotes the ex- 
posure that is set by an intervention (overruling the 
"natural" X). An MSMM parameterizes the LCRR 
(2) and is given by 

f E(Y\X = x,G,do(X = x)) \ 

(15) log< = }=1lx, 

y ' 1 E(Y\X = x, G, do(X = 0)) J 

where X = stands for a suitable baseline value as 
before. Hence, 7^2; is the log-relative risk of chang- 
ing the exposure to this baseline for those who would 
normally be exposed to X = x, where it is assumed 
that the effect is the same within different levels of 
the instrument G. This does not follow from the 
core IV conditions nor from the structural assump- 
tion (5). It means, for example, that reducing the 
alcohol intake for those individuals who are heavy 
drinkers has the same effect on the relative risk for 
hypertension regardless of their ALDH2 genotype. 
This may be unrealistic if those who drink much 
despite carrying the ALDH2*2 variant are different 
in relevant aspects from those who drink much and 
do not carry this allele. An analogous assumption 
is made by the additive SMM (11) but for the risk 
difference; note that except for trivial cases both, 
the assumption that G does not modify the effect 
on the multiplicative and on the additive scale, can- 
not be true at the same time [32]. This assumption 
of no heterogeneity with respect to levels of G is 
required so that the model has only one unknown 
parameter, since we can only identify one parame- 
ter. When baseline covariates have been measured, 
it is possible to identify more complex SMMs and 
this assumption could be relaxed [3, 27, 32], but we 
do not consider this any further here. 

In general, a SMM estimator for a causal parame- 
ter is obtained by solving estimating equations that 
are based on the exclusion restriction mentioned in 
Section 3.2. The solution typically does not have a 
closed form expression. However, for the case where 
X and G are binary, an explicit solution exists [32, 
57] (cf. also the Appendix), yielding that exp(— jl) 
equals 

E(Y\G = 1) -E{Y\G = 0) 



(16) 



1 



E(YX\G = 1) - E(YX\G = 0) ' 
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The parameter jl can easily be estimated using the 
corresponding empirical frequencies or averages. Un- 
der the multiplicative SMM, we hence obtain that 
the estimand is the inverse of (16), which we will 
call MSMMRR. It identifies the LCRR under the IV 
core conditions and the assumptions of an MSMM. 
In order for it to also identify the population effect 
CRR, it is sufficient to assume that X and U cannot 
interact on Y on the multiplicative scale [32]. This 
is analogous to the "no interaction" assumption in 
linear model (8). In this special case we can also 
obtain closed formulae for the odds ratio and risk 
difference [32, 57] (cf. also the Appendix). 

Logistic structural mean models have been pro- 
posed [67], but these require more restrictive as- 
sumptions, and conditions enabling consistent es- 
timation are relatively complicated. We therefore 
omit them here. Robins and Rotnitzky [59] provide 
a detailed discussion of the fundamental difficulty 
with identifiability in SMMs, other than for the ad- 
ditive or multiplicative cases. 

4.4 Comparison of Assumptions 

The estimands, their target causal effects and the 
conditions for identification are summarized in Ta- 
ble 1. (WaldOR as an approximation to WaldRR is 
omitted.) The following points are noteworthy: 

• One could say that the strongest assumptions are 
those underlying the WaldRR and WaldOR, as 
they rely on a specific outcome model for the dis- 
tribution of Y given (X, U) , as well specific 
exposure model for the distribution of X given 
(G,U). Neither the linear models nor the SMMs 
require the latter. 

• All IV approaches underlying point-estimation rely 
on some "no- interaction" (or homogeneity/no ef- 
fect modification assumption). No interaction be- 
tween X and U on the linear (or log-linear) scale 
in the sense of model (8) [or model (12)] is suf- 
ficient to ensure the assumption of no interac- 
tion between X and G in the additive (or mul- 
tiplicative) SMMs, models (11) and (15) (see the 
Appendix). However, the "no-interaction" assump- 
tion may either be true on the linear or on the log- 
linear scale, but not both, except in trivial cases 
like Y _LL X\U or Y JL U\X [32]. 

• In contrast to the MSMM, the linear and Wald- 
type models do not require joint information on 
(X, Y, G) ; they allow identification of the causal 
parameter based on separate information on the 



joint distribution of (X,G) and of (Y, G), only. 
This means that an IV analysis can be performed 
by exploiting results, for example, from different 
existing genetic studies or meta-analyses as is par- 
ticularly relevant for Mendelian randomization ap- 
plications [46, 66]. In addition, the WaldOR is use- 
ful for case-control studies where, under the rare 
disease assumption, 5 can be approximated by a 
control group estimate [35]. 

5. NUMERICAL ILLUSTRATION OF 
ASYMPTOTIC BIAS 

In the previous section we have given some exam- 
ples of standard models that allow point-identification 
of a causal parameter exploiting an IV. In practice, 
such model assumptions are unlikely ever to hold ex- 
actly, and we should be concerned with the robust- 
ness of IV methods under violations of such assump- 
tions. Therefore, in this section we illustrate the pos- 
sible bias of the above approaches for a set of con- 
crete scenarios that would be realistic, for instance, 
in a Mendelian randomization study. We place im- 
portance on the following issues: 

• A sensible IV model should allow consistent esti- 
mation at the null- hypothesis of no causal effect. 

• A sensible IV model should also allow consistent, 
or at least not seriously biased, estimation when 
there is in fact no confounding, and hence a "naive" 
analysis, based on a regression of response Y on 
exposure X without using an IV, would be valid. 

• A sensible IV model should also not induce more 
bias than such a naive approach. 

We want to investigate which of the various IV meth- 
ods satisfy these desiderata, or what situations lead 
to the most serious violations. 

Using the notation introduced at the beginning of 
Section 4, we base our comparison on the difference 
between the targeted causal parameter 9* and the 
estimand 9 m under a given model A4, evaluated at 
the true distribution P* . More precisely, we use the 
relative measure 

9 M -0 
9 ' 

which is the asymptotic relative bias of any consis- 
tent estimator 9 m for 9m ■ If the model is correctly 
specified, that is, P* £ A4, and identifies the causal 
parameter, then the above is zero. The asymptotic 
relative bias can be calculated exactly, using numer- 
ical integration where required, under a given choice 
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Table 1 



Summary 


of IV model assumptions 
(in 


under which the various estimands identify the targeted causal effects 
addition to general IV assumptions) 


Estimand 


Target 


Model assumptions 


LIVAE 


ICE 


Constant additive individual effect \Y r (x) linear in a;]. 


LIVAE 


ACE 


E(Y\X — x,U = u) linear in x, no (X, [/)-interaction on additive 
scale. 


LIVAE 


LACE 


E(Y\X = x,G = g)-E(Y\X = x,G = g,do(X = 0)} linear in x, no 
(X, G)-interaction on additive scale. 


LIVRR 


CRR 


Same as LIVAE for ACE. 


WaldRR 


CRR 


(i) E(X\G — g,U = u) linear in g, no (G, C/)-interaction on addi- 
tive scale, additive independent residual, (ii) E(Y\X = x, U = u) log- 
linear in x, no (X, [/)-interaction on multiplicative scale. 


MSMMRR 


LCRR 


\og{E(Y\X = x,G = g)/E(Y\X =x,G = g,do(X = 0))} linear in x, 
no (X, G)-interaction on multiplicative scale. 


MSMMRR 


CRR 


As for LCRR, and no (X, C/)-interaction on multiplicative scale. 



of a "true" joint distribution P* of (X,Y,G,U) (see 
below) . In special cases it is even possible to express 
the bias explicitly as in [8] for the linear case. Note 
that we are not considering any sampling properties 
of specific estimators 6m and hence are not simu- 
lating any data. 

We restrict our numerical comparison to the causal 
relative risk, 9* = CRR, as target. We compare the 
linear model, with estimand LIVRR, the log-linear 
Wald type approach, with estimand WaldRR (Wal- 
dOR is always slightly more biased for CRR than 
WaldRR and is therefore omitted), and the multi- 
plicative SMM, with estimand MSMMRR, which all 
identify the CRR under their respective assumptions 
as detailed in Section 4. 

5.1 Full Model 

The true joint distributions P* for (X,Y,G,U) 
that we use for the comparison are specified as fol- 
lows. To facilitate interpretation and to keep the 
number of parametric and distributional choices lim- 
ited, we consider dichotomous observable variables 
Y, X and G with the following interpretations: 

y _ f 1, diseased, 
0, healthy, 

„ _ f 1, exposed, 

1 0, not exposed, 

and we label G = 1 to denote the value of the in- 
strument that predisposes to X = 1. 

The dependence of Y on X and U is given by a 
logistic regression. In addition, we assume that this 



model is invariant with respect to intervention on 
X, by which we mean 

logitE(Y\X = x,U = u) 

(17) = logitE(Y\do(X = x),U = u) 

= ot,\ + a^x + a-ju + a^xu. 

The conditional distribution of X given G and U is 
also determined by a logistic dependence: 

logit E(X\G = g,U = u) 

(18) 

= Pi + fag + fau + f3 A gu. 

Finally, the marginal distribution of G is deter- 
mined by p g = P(G = 1) , which we set to 50% through- 
out (all estimands are unaffected by p g ), while p(u) 
is continuous and set to have a uniform distribution 
on [0,1]. 

The true CRR can easily be calculated from the 
above using (5) and integrating out as 

(19) exp(— «i — U2 — a^u — ct4u)}~ 1 p(u) du 

j {1 + exp(— a\ — asu)}~ 1 p(u) du 

Note that the CRR does not depend on (18), but 
0M does for the IV models considered here. 

For the above true distributions P* , all models 
from Section 4 are essentially misspecified, since none 
of them model a logistic dependence of Y on (X, U). 
Exceptions are «2 = 04 = 0, or for the linear and 
MSMM when 03 = 04 = 0. Also, note that if 0:4 = 0, 
then there is no effect modification by U on the lo- 
gistic scale. This does not strictly imply no effect 
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modification on the additive or multiplicative scales, 
though departure from these assumptions will be 
more extreme when a^^O. 

Our choice of P* is motivated by the fact that 
a logistic model like (17) would be the standard 
model assumption for a binary outcome if the con- 
founder^) U could be observed. It is noteworthy that 
this default model assumption for the case of ob- 
served confounding is not necessarily compatible with 
standard IV methods for unobserved confounding. 

5.1.1 Settings of the parameters There are eight 
parameters in (17) and (18). By varying these, we 
consider the following set of scenarios which we re- 
gard as realistic for epidemiological studies based on 
Mendelian randomization, for example. 

We choose three strengths for the causal effect: 
none (CRR = 1.0), small (CRR = 1.33) and large 
(CRR = 3.03); this is obtained by adjusting «2 ac- 
cordingly. Confounding is varied by setting 0:3 € {0, 
0.1, 1, 2}, while keeping /?3 = 2 fixed. Interactions are 
investigated by varying ^4,04 € {—1,0, 1}, but note 
that we only consider combinations where \a^\ < 
\cts\, as large interactions with small main effects 
are commonly perceived as unrealistic. The remain- 
ing parameters are chosen so as to satisfy the fol- 
lowing criteria. The strength of the association be- 
tween G and X is kept constant at a relative risk of 
2.4 throughout by adjusting P2 accordingly. We fix 
the marginals P(X = 1) = 0.13 and P(Y = 1) = 0.03 
by setting fii and a\ accordingly. These latter val- 
ues, respectively, are again typical for the exposure 
frequencies and rare disease situations, as are often 
encountered in Mendelian randomization studies. 

5.1.2 Bounds To further characterize the chosen 
scenarios, we calculated the nonparametric bounds 
for the CRR (and the ACE for comparison) [4, 22, 
44, 57] for all our settings and found that they were 
always extremely wide and always included the null 
hypothesis of no effect. For those settings where 
CRR = 3.03, for instance, the bounds were of the 
order [0.2, 30] (and about [-0.08, 0.8] for the ACE 
where the true ACE was around 0.06). These are 
the "tightest assumption-free bounds" [4], meaning 
that the observable frequencies p(y,x,g) alone, de- 
rived from the above distributions by marginalizing 
over U, do not allow us to narrow down the causal 
effects any further. This re-emphasizes the fact that 
point-identification via an IV model relies heavily 
on the additional parametric assumptions that have 
to be made. Narrower bounds can be obtained when 



a stronger instrument is used, that is, by increasing 
the G-X association. However, the relative risk of 
2.4 used here is about as strong as we would expect 
to see in a Mendelian randomization study. 

5.2 Numerical Results 

We now compare the asymptotic biases of the 
LIVRR, WaldRR and MSMMRR. In addition, we 
consider the naive relative risk, NRR, obtained as 
P*(Y = 1\X = 1)/P*(Y = l\X = 0), which gives an 
indication of the bias of a standard analysis when 
not using an IV. In our settings, the NRR is unbi- 
ased when there is no confounding, but not neces- 
sarily otherwise. 

5.2.1 No causal effect We begin with the case where 
CRR = 1, which usually constitutes the null hypoth- 
esis. When 04 = C12 = 0, no table is shown as none 
of the IV models from Section 4 are misspecified, 
only the NRR is biased by as much as 39%. How- 
ever, CRR = 1 can also arise when a<i and 0:4 are 
nonzero and of opposite signs. The relative biases 
for the corresponding settings are shown in Table 2. 

The problem we mentioned earlier, and that be- 
comes evident here, is that there can be two types of 
scenarios where CRR = 1: either there is no causal 
effect of exposure in any subgroup (02 = «4 = 0), or 
there are different causal effects in subgroups which 
cancel out overall. The latter occurs when a.2 and «4 
are nonzero in such a way that the ratio of integrals 
in (19) happens to be one. 

Table 2 

Asymptotic relative biases when estimating CRR for all 
settings with CRR = 1 and «4 7^ 



Relative bias 



«3 


0:4 


04 


NRR 


LIVRR 


WaldRR 


MSMM 


1 


1 





0.277 


0.105 


0.110 


0.095 


2 






0.414 


0.092 


0.095 


0.075 


1 


-1 




0.020 


-0.113 


-0.108 


-0.101 


2 






0.174 


-0.106 


-0.102 


-0.087 


1 


1 


1 


0.361 


0.198 


0.213 


0.163 


2 






0.545 


0.177 


0.189 


0.125 


1 


-1 




0.025 


-0.202 


-0.187 


-0.169 


2 






0.226 


-0.195 


-0.181 


-0.140 


1 


1 


-1 


0.184 


0.006 


0.006 


0.006 


2 






0.272 


0.002 


0.002 


0.002 


1 


-1 




0.013 


-0.009 


-0.009 


-0.009 


2 






0.115 


-0.006 


-0.006 


-0.005 
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All IV methods exhibit some bias in these sce- 
narios, with around 20% relative bias in the worst 
case. We can see the following patterns in Table 2. 
When /?4 = — 1, all IV estimators are only slightly 
biased, while the NRR can be biased by up to 27%. 
There are only two settings where all IV methods 
are more biased than the naive one, and these are 
when 04 = —1 and 03 = 1, and /?4 = or 1. For 
all considered settings, the MSMMRR is the least 
biased, and the WaldRR is the most biased, but 
the order of magnitude is generally comparable and 
we would not suggest an overall ranking of the ap- 
proaches based on these results alone. 

Recall that the MSMMRR does not actually tar- 
get the CRR, but targets a particular subgroup effect— 
the local causal relative risk of exposure within the 
exposed — instead. The latter is typically not one when 

5.2.2 Causal effect but no confounding Let us now 
consider those scenarios where there is no confound- 
ing (so either 0:3 = 04 = or f3% = /S4 = 0). No plots 
or tables are shown here as only the WaldRR has 
nonzero bias. This is because all assumptions of the 
naive, linear and multiplicative structural mean mod- 
els are satisfied when there is no confounding and 
when X and Y are binary. In contrast, as noted in 
Section 4.2 and again in the Appendix, the assump- 
tion (14) underlying the WaldRR cannot be satis- 
fied when X is binary. We observed biases for the 
WaldRR and WaldOR of up to 3.2% and 4.5%, re- 
spectively, for a moderate effect size of CRR = 1.33, 
and biases as large as 65% and 76%, respectively, 
when CRR = 3.03. 

5.2.3 Causal effect and confounding We now con- 
sider those scenarios where there is a causal effect 
as well as confounding. Tables 3 and 4 show the re- 
sults for a small causal effect (CRR= 1.33) and a 
large causal effect (CRR = 3.03), respectively. 

First, let us compare the results for small versus 
large CRR. The naive relative risk (NRR) behaves 
similarly in both cases. The LIVRR is more biased 
when the true causal effect is large — this is plausible 
as the nonlinearity of the model is more pronounced 
for larger causal effects. The WaldRR is unaccept- 
able when CRR = 3.03: with relative biases between 
40% and 250%, it seriously overestimates the true 
effect. As its bias is either comparable to, or much 
larger than, the bias for the other two IV methods 
when CRR = 1.33, we will not consider the WaldRR 
any further. The relative bias of the MSMMRR, in 
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Asymptotic relative biases when estimating CRR for all 
settings with CRR = 1.33 
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turn, is similar for small and large CRR with a max- 
imum of 17%. 

As one might expect, the LIVRR and MSMMRR 
are only slightly biased, and much less so than the 
NRR, whenever there is no X—U interaction, 04 = 0. 
More surprising is that this is also the case when 
/?4 = — 1 regardless of the other parameter values. 
This is not due to less confounding, as we can see 
that the naive relative risk is still noticeably biased 
in those settings. 

All methods struggle the most when 04 7^ and 
/3 4 = 1— the MSMMRR bias then reaches 17% and 
the extent of the LIVRR bias can range from 24% 
for small CRR to 45% for large CRR. 

Even though there is no uniformly best method, 
both tables show that the MSMMRR is much less 
biased in most settings. The only cases where it is 
outperformed by the LIVRR arise when = — 1. 
The only cases where it is outperformed by the NRR 
are when additionally 03 = 1. 

5.2.4 Sign of bias Due to our choices of the coeffi- 
cients of U, the NRR is always positively biased. The 
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Table 4 

Asymptotic relative biases when estimating CRR for all 
settings with CRR = 3.03 

Relative bias 

a 3 a 4 /3 4 NRR LIVRR WaldRR MSMM 

0.1 0.014 0.006 0.671 -0.001 

1.0 0.145 0.066 0.870 -0.006 

2.0 0.289 0.128 1.090 -0.010 

1.0 1 0.265 0.161 1.220 0.084 

2.0 0.397 0.210 1.410 0.061 

1.0 -1 0.020 -0.036 0.539 -0.102 

2.0 0.167 0.033 0.757 -0.093 

0.1 1 0.018 0.013 0.695 -0.001 

1.0 0.188 0.132 1.110 -0.010 

2.0 0.379 0.263 1.630 -0.017 

1.0 1 1 0.344 0.334 1.950 0.144 

2.0 0.523 0.447 2.510 0.102 

1.0 -1 1 0.025 -0.070 0.440 -0.170 

2.0 0.217 0.062 0.858 -0.150 

0.1 -1 0.009 -0.001 0.647 -0.000 

1.0 0.096 -0.006 0.637 -0.000 

2.0 0.191 -0.014 0.605 -0.000 

1.0 1 -1 0.176 -0.019 0.590 0.005 

2.0 0.261 -0.028 0.570 0.003 

1.0 -1 -1 0.013 0.004 0.663 -0.009 

2.0 0.110 -0.002 0.648 -0.006 



IV estimators can, however, be negatively biased, es- 
pecially when «4 or (3^ are negative. Also, their bias 
does not always have the same sign. Therefore, we 
cannot say that IV methods generally over- or un- 
derestimate the true causal effect. 

5.2.5 Other comparisons We also considered the 
other causal parameters, ACE and COR, as targets 
in our chosen scenarios using the corresponding es- 
timands under the three IV models. We got broadly 
similar results with the SMM approach generally 
producing less biased results, except in the pres- 
ence of interactions, and the Wald approach behav- 
ing very poorly throughout even when there is little 
or no confounding. 

All results presented so far were for scenarios with 
3% disease frequency and 13% exposure frequency. 
We also considered scenarios with 20% disease and/or 
50% or 85% exposure frequencies, but do not report 
them in detail as the results followed similar pat- 
terns in terms of relative performances of the var- 
ious approaches. All IV methods show much less 



bias with 50% exposure frequency, with the Wal- 
dRR performing much more sensibly, in particular. 
The MSMM is still clearly the least biased and is 
not sensitive to interaction effects when the expo- 
sure frequency is 50%. This might be due to the 
exposure distribution being more balanced, so that 
conditioning on X is not so informative for U and, 
hence, the local causal effect is not much different 
from the population causal effect even when there 
are strong interactions. 

5.3 Practical Implications 

In Section 4.4 we compared the assumptions un- 
derlying the IV models of Section 4 on theoretical 
grounds. The above numerical study adds the fol- 
lowing insights: 

• The linear IV approach is often not considered 
appropriate when the outcome variable is binary 
or nonnegative. However, we found that it per- 
formed better than expected for binary Y with 
relative asymptotic bias below 20% in all but six 
of the considered scenarios and with less bias than 
that of the naive approach in all but five scenarios. 
This may be deemed acceptable, especially given 
the simplicity of the linear IV estimator. However, 
for the linearity assumption to be at least approx- 
imately appropriate with binary outcomes, the 
range of exposure X should be restricted and the 
true causal effect small. The latter is not uncom- 
mon for epidemiological — especially Mendelian 
randomization — applications. 

• Although it is clear by theory alone that the Wald 
type methods from Section 4.2 make very strong 
assumptions, we have seen here that they are not 
just slightly but can be extremely biased when 
these assumptions are violated. It is especially 
worrisome that this occurs for realistic scenarios, 
that the bias can be worse than with the naive ap- 
proach and increases with the strength of the true 
causal effect, and that they can be biased even 
when there is no confounding since the model for 
the exposure X is violated. We would therefore 
not recommend this approach unless there is good 
reason to be confident in the model assumptions. 
A small true causal effect and a balanced or ap- 
proximately normal distribution of the exposure 
X, possibly after suitable transformation, would 
support this confidence. 

• As mentioned before, all IV approaches, excluding 
the bounds, make an assumption of no-interaction 
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or no effect modification by the unobserved con- 
founder U either on the additive or multiplicative 
scale. The results show that violation of this as- 
sumption indeed seriously increases the bias of all 
IV methods and can lead to bias even at the null 
hypothesis of no causal effect. In practice, this 
assumption is difficult to asses or justify as it in- 
volves the unobserved confounders which might 
include factors that are poorly understood. 
• As far as the relative bias is concerned, the MSMM 
approach seems the most recommendable for sit- 
uations similar to those of Section 5.1, especially 
for binary outcomes. However, other properties 
are relevant for practical application, most impor- 
tant being the efficiency of the estimators. As our 
numerical study only considers a specific set of 
scenarios, it is also not possible to say whether 
the MSMM performs equally well in very differ- 
ent situations. We therefore recommend that fur- 
ther comparison and sensitivity analyses are car- 
ried out for any specific application. 

6. CONCLUSION AND DISCUSSION 

Our theoretical comparison of different IV meth- 
ods was motivated by the need for such methods 
in observational epidemiology, with Mendelian ran- 
domization applications providing an example that 
has generated a lot of recent interest. The core con- 
ditions 1-3 plus the structural assumption (5) are 
sufficient for testing for a causal effect of exposure 
on disease, but, as emphasized here, the identifica- 
tion of a causal effect has to rely on additional model 
assumptions which, if inappropriate, can induce bias 
as illustrated in our numerical study. The need for a 
comparison of IV methods is also highlighted by the 
results of a recent study which concluded that there 
were very few differences between IV approaches be- 
cause they yielded similar results on particular data 
sets [54, 56]. Our results do not support this point 
of view and show that any model assumptions have 
to be justified carefully. 

The main points to be made from our comparison 
are that the different IV approaches target different 
parameters, where we are not referring to the dif- 
ference between a risk difference and risk ratio, for 
instance, but the difference between an individual, 
population or local causal effect. In the case of the 
latter, the SMM approach (additive or multiplica- 
tive) makes the weakest assumptions, as it does not 



require a model for the exposure X given the instru- 
ment G, and it only assumes (log-)linearity of the ef- 
fect within the exposed individuals. Under stronger 
assumptions, essentially if U and X do not interact 
on Y on the relevant scale, the local causal effect 
is equal to the population causal effect. However, 
the multiplicative SMM requires joint data on the 
observable variables which may not always be avail- 
able from existing studies. For the linear model it 
has also been noted by [8] that the traditional ratio 
estimator LIVAE has to be given a different inter- 
pretation in the presence of effect modification. The 
Wald type estimator for the relative risk, together 
with the odds ratio as an approximation to the lat- 
ter, is simple and useful for meta-analyses but makes 
very specific assumptions about all conditional dis- 
tributions, especially that of the exposure, and also 
requires the absence of interactions on the multi- 
plicative scale. 

Our bias calculations are of course only valid for 
the particular model and scenarios we chose to con- 
sider, but we believe they still raise serious issues. 
Not surprisingly, all estimators encounter difficul- 
ties in estimating the population effect in scenar- 
ios where the exposure has different effects within 
levels of the unobserved confounder. Maybe more 
surprising are the particularly poor performances of 
the Wald relative risk and odds ratio — especially in 
the absence of confounding. This is supported by 
a recent study on odds ratio estimators which also 
found that the WaldOR was often outperformed by 
other approaches [3]. However, we did not find that 
it did "especially well" at the causal null hypothe- 
sis, as reported there, when there were interactions 
in the model for the outcome Y. An obvious im- 
plication for practical applications of IV methods 
is that the plausibility of such interactions, on the 
chosen effect scale, should be explicitly addressed. 
If such interactions are judged to be likely on the 
multiplicative scale, then the MSMM estimator is 
closer to the local effect and the Wald relative risk 
is likely to be seriously biased. Also, one has to keep 
in mind that such interactions can induce bias of all 
IV methods even at the null hypothesis of no causal 
effect, though one might hope that such exact can- 
cellations of subgroup effects are rare. It might be 
argued that, in practice, important effect modifiers 
will be known and observed as additional covariates, 
so that once these are taken into account, only neg- 
ligible interactions with the unobserved confounders 
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remain, but by definition this cannot be verified em- 
pirically. Note that any justification for the absence 
of effect modification has to take the chosen mea- 
surement scale into account. Due to the increased 
bias we have seen in our numerical study, we would 
therefore recommend that practical applications of 
IV methods be complemented by some sensitivity 
analyses, especially with regard to such interactions 
in the model for the outcome Y. Moreover, we would 
advise that these considerations are also valid for 
continuous outcomes which are often analyzed un- 
questioned with linear no-interaction models. 

The particularly restrictive assumptions underly- 
ing the WaldRR (and WaldOR) raise serious con- 
cern about how to handle situations where we do 
not have joint information on all the relevant vari- 
ables, such as in most meta-analyses, rendering the 
multiplicative SMM estimator inapplicable. The lin- 
ear IV estimator could, in principle, be applied, as 
it too does not require joint data and is not as badly 
biased, but for binary disease outcomes, risk differ- 
ences are rarely reported. In most applications the 
exposure is continuous and robustness of the non- 
linear Wald estimators to violations in those cases 
remains to be investigated. It certainly does not 
seem advisable to dichotomize a continuous expo- 
sure. 

We have only considered the asymptotic bias of 
the various estimators. In practice, their efficiency 
will also be of major concern. It is well known that 
IV estimators have larger variance than the naive 
estimators when there is no unobserved confound- 
ing. The variance, unlike the bias, very much de- 
pends on the strength of the instrument, but when 
there is strong confounding, it is impossible to find a 
strong instrument [7, 45] . The SMM estimators, de- 
rived from estimating equations, can be made semi- 
parametrically efficient by choosing appropriate 
weights in these equations [58]. Some methods for 
improving the efficiency of the Wald type relative 
risk have been proposed [47]. Further comparisons 
of properties and sampling behavior of IV estima- 
tors for the special case of a binary outcome can be 
found in [3, 13]. 

Another important issue that we have not ad- 
dressed here is that of measurement error. Theo- 
retically, it is not a problem if the IV is affected by 
measurement error, as long as this is not differential. 
If the exposure is affected by measurement error, we 
can still use the IV approach to test for a causal ef- 
fect. However, all the above IV estimators are then 



expected to be biased, as core condition 3 is likely 
to be violated when X is the measured, and not the 
true, exposure. In that case, we have to make even 
more modeling assumptions, namely, about the spe- 
cific measurement error process, in order to obtain 
valid point estimates [68]. 

APPENDIX 

Justification of LIVAE 

We have established that the ACE is equal to 
the model parameter j3 in model (8). Define G = 
G-E(G), then E(YG) = Cov(Y,G). With core con- 
dition 1 and model (8), 

E(YG) = E G E(YG\G) 

= E G (pE(XG\G) + GE(h(U))) 

= pE{XG). 

Hence, fi = Cov(Y,G)/ Cov(A", G), which is the 
LIVAE estimand. 

Risk ratios or odds ratios require estimation of the 
intercept of (9) obtained as follows: 

a = E(Y)-j3E(X), 

where /3 = LIVAE from above. Hence, the CRR and 
COR are identified by 

LIVRR:=^±A 

a 

LIV0R:= (°+ffla-«). 

d(l — a — (3) 

Further, under the additive SMM (11) we have 
by simple rearranging that E(Y\X, G, do(X = 0)) = 
E(Y - PlX\X,G), where we use that E(Y\X = 
x,G,do(X = x)) = E(Y\X = x,G). The exclusion 
restriction implies that Y _U_ G\ do(A = 0) (cf. Fig- 
ure 2), which induces an estimating equation to ob- 
tain Pl based on the moment condition E{(Y — 
p L X)G) = 0, where G = G- E(G), as before. The 
solution is again Pl = Cov(Y, G)/ Cov(X, G). 

Justification of WaldRR 

In addition to the model assumptions expressed 
in (12) and (13), we need (14), that is, the random 
variable $:=X- E(X\G, U) has to satisfy £A.G\U. 
This is automatically satisfied when X has a normal 
distribution with constant variance given (G,U), or 
a variance that only depends on U. More generally, 
this is satisfied when the model for X given (G, U) 
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is a location-scale family, where only the location 
parameter depends on G, U, for example, the class of 
(noncentral) t-distributions; any class that restricts 
the support of the distributions it contains, like the 
Bernoulli, will not typically satisfy this condition, 
though. 

Hence, by definition, we can write X = 5G+k(U) + 
£. Consider now a regression of Y on G alone and 
substitute this expression for X: 

E(Y\G = g) 

= EuE xlG=g<u E(Y\X,U) 

= ^[exp{Mtf)}£x|G= 5 ,i/exp{ 7 X}] 

= Eu[exp{h(U)} 

■Et\G=g,U exp{7(<$<? + k(U) + £)}] 
= exp{<y6g}Eu[exp{h(U)+>yk(U)} 
•£ ? |G= 9 ,t/ ex P{7£}] 

= const ■ exp{7<5g}, 

where (*) uses £ ALG\U, so that E^ G=g ^ exp{7^} 
is constant in G. Hence, the coefficient of G in a log- 
linear regression of Y on G is jS. Furthermore, 5 can 
be recovered from a linear regression of X on G, as 
the latter is independent of U. Thus, as stated in 
Section 4.2, the CRR is identified by the WaldRR. 

Justification of MSMMRR 

Analogously to the argument for the additive SMM, 
we have by simple rearranging that 

E(Y\X,G,do(X = 0)) 

(20) 

= E(Yexp(- lL X)\X,G). 

The exclusion restriction Y 1L G\ do(X = 0) now in- 
duces an estimating equation to obtain 7 £ based 
on the moment condition E(Y exp(— jlX)G) = 0, 
where still G = G — E(G). Due to the nonlinearity 
of the exponential function, this does not have a sim- 
ple closed form solution as in the linear case, except 
for binary variables as shown next. 

When G is binary, the exclusion restriction implies 
that E(Y\G = l,do(X = 0)) = E(Y\G = 0,do(X = 
0)). By averaging over X, 

E(Yexp(- lL X)\G = 1) = E(Y exp(- 7L X)|G = 0). 

When X and Y are binary as well, we obtain that 
E(Y exp(- 7L X) \G) is equal to E(YX exp(- 7L )|G) - 
E(YX\G) + E(Y\G). Hence, we can rearrange the 
above equality to give (16). 



Under additional assumptions, the ACE and COR 
are also identified in an MSMM. First, we note that 
by integrating out first G and then X from (20), we 
obtain an expression for E{Y\ do(X = 0)) as 

e^ /L E(Y\X = l)P(X = 1) + E(Y\X = 0)P(X = 0). 

If we assume that the Y-X relative risk is the same 
within subgroups of U as in model (12), then exp(7^) 
is also the (population) CRR (cf. also next section). 
Thus, by substituting, we now obtain an expression 
for E{Y\ do(X = 1)) as 

E(Y\X = l)P{X = 1) + e lL E{Y\X = 0)P(X = 0). 

From these it is straightforward to obtain the esti- 
mands that identify the ACE or COR by replacing 
7l by the negative log of (16). 

Relations Between Assumptions 

Under the IV conditions the linear model (8) im- 
plies the additive SMM (11). As E(Y\X = x,U = 
u) = E{Y\do(X = x),U = u) =fix + h(u), with def- 
inition of X from Section 3.1, 

E(Y\X = x ,G = g, do(X = x)) 
= Px + E(h{U)\G = g,X = x) 

and, hence, 

E(Y\X = x,G = g, do{X = x)) 

- E(Y\X = x ,G = g, do(X = 0)) = fix, 

which is an additive SMM. 

It can be shown analogously that the log-linear 
model (12) implies the MSMM (15). In each case 
the reverse is not true, as discussed by Her nan and 
Robins [32] for the special case where all variables 
are binary. 

Further, the structural equation model (10) im- 
plies model (8) and hence (11). The former states 
that the potential responses of a generic individual 
are given as Y l {x) = f3jx + £*, where £ l is fixed for 
the individual but not between individuals. Hence, 
across the population E(Y(x)\U = u) = f3jx + E(^\U = 
u). Interpreting E(Y(x)\U = u) as E(Y\ do{X = x),U = 
u) and using (5), we obtain E(Y\X = x,U = u) = 
fijx + h(u), which is equivalent to (8). The reverse 
is clearly not true as counterexamples are easy to 
construct. 
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